server,cli: Support Init Command #16371

adamgee · 2017-06-06T23:06:10Z

cockroach-teamcity · 2017-06-06T23:06:17Z

This change is

bdarnell · 2017-06-12T21:36:59Z

Reviewed 2 of 2 files at r1, 6 of 6 files at r7.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.

pkg/cli/flags.go, line 277 at r7 (raw file):

	}

	varFlag(initCmd.Flags(), &serverCfg.JoinList, cliflags.Join)

Why? I think cockroach init should take --host like a client command, not --join like cockroach start.

pkg/cli/init.go, line 24 at r7 (raw file):

	"github.com/spf13/cobra"

	//	"github.com/cockroachdb/cockroach/pkg/server"

Delete before merging.

pkg/cli/init.go, line 47 at r7 (raw file):

	request := &serverpb.BootstrapRequest{}
	fmt.Println("Request:", request)
	log.Info(ctx, "Request: ", request)

We probably want to remove these prints and logs before merging.

pkg/server/init.go, line 33 at r7 (raw file):

	server       *Server
	bootstrapped chan struct{}
	mux          sync.Mutex

s/mux/mu/

If the mutex only protects waiting, we like to use this anonymous struct pattern:

mu struct {
    sync.Mutex
    waiting bool
}

But if it's held for the entire duration of Bootstrap, it might be better not to do that to highlight the fact that it protects more than just the variables it contains.

pkg/server/init.go, line 42 at r7 (raw file):

func (s *initServer) serve(ctx context.Context, ln net.Listener) {
	grpcServer := s.server.grpc

If we're sharing the grpcServer with the rest of the Server, how can we be confident that the other services haven't been registered yet? The initServer could also remain after the initialization phase, which I'd prefer to avoid.

Can we pass in an RPCContext and make a new grpc.Server instead?

pkg/server/init.go, line 51 at r7 (raw file):

func (s *initServer) awaitBootstrap() error {
	{

Remove these braces, they don't mean anything (in Go, braces define variable scopes but not defer scopes)

pkg/server/init.go, line 53 at r7 (raw file):

	{
		s.mux.Lock()
		s.waiting = true

We should probably set waiting to false when this method exits (in a deferred function), just as an extra sanity check.

pkg/server/server.go, line 542 at r7 (raw file):

	readyToServe := int32(0)
	initL := m.Match(func(_ io.Reader) bool {
		return atomic.LoadInt32(&readyToServe) == 0

Ah, this is a clever approach to the problem. I'd rather not expose the initServer at all if we're already bootstrapped, though. That would mean keeping serveOnMux and calling it in two places, both at the end where it was originally and just before awaitBootstrap.

pkg/server/server.go, line 599 at r7 (raw file):

	s.stopper.RunWorker(workersCtx, func(context.Context) {
		<-s.stopper.ShouldQuiesce()
		netutil.FatalIfUnexpected(anyL.Close()) // TODO: Do we need to also close the other listeners?

Yeah, it seems like we should close all of them or none of them here. I'm not sure why we only close this one.

pkg/server/server.go, line 792 at r7 (raw file):

	// Set the readyToServe bit which removes the initL from cmux.
	if !atomic.CompareAndSwapInt32(&readyToServe, 0, 1) {

This could be atomic.StoreInt32(&readyToServe, 1). There are no other conflicting changes that could be happening.

pkg/server/serverpb/init.proto, line 1 at r7 (raw file):

// Copyright 2016 The Cockroach Authors.

s/2016/2017/

Comments from Reviewable

bdarnell · 2017-07-21T03:44:41Z

I'm taking over this PR; I've added some tests and done a little cleaning up. There's still a little more I want to do (mainly figure out why the acceptance tests fail if I run the bootstrap against a node other than zero), but I think it's pretty much ready to go.

petermattis · 2017-07-21T19:52:41Z

Review status: 1 of 16 files reviewed at latest revision, 14 unresolved discussions, some commit checks failed.

pkg/cli/init.go, line 30 at r10 (raw file):

	Use:   "init",
	Short: "TODO",
	Long:  "TODO",

Reminder to fill these in.

pkg/cli/init.go, line 44 at r10 (raw file):

	request := &serverpb.BootstrapRequest{}
	_, err = c.Bootstrap(ctx, request)
	if err != nil {

Maybe tighten this up a small bit:

if _, err := c.Bootstrap(ctx, &serverpb.BootstrapRequest{}); err != nil {

pkg/server/server.go, line 581 at r10 (raw file):

	// Inject an initialization listener that can optionally intercept
	// all connections.

that will intercept all connections while the cluster is initializing.

Comments from Reviewable

bdarnell · 2017-07-22T20:26:20Z

(mainly figure out why the acceptance tests fail if I run the bootstrap against a node other than zero)

The reason this was failing is that nodes don't start their gossip server until they're initialized. I've made an attempt to changes this, but it runs into difficulties at the RPC layer (need to stop and restart the rpc server to switch from init to regular mode, and make sure that all the long-lived gossip connections either stay connected or stop and restart cleanly) and inside gossip (the server side of gossip has some assumptions that the cluster has been initialized before the server starts)

So we can either
A) Live with the limitation that the init command must go to nodes that are join targets, instead of an arbitrary node
B) Fix the above issues
C) Introduce a second RPC in the init service to tell a join target to stop waiting for the init command

I'm leaning towards option A. @a-robinson, would that interfere with the use of the init command in kubernetes/swarm/etc deployments?

petermattis · 2017-07-24T13:36:55Z

Option A is likely the easiest, but it feels a bit unsatisfactory. I'm a little out of date with regards to the node initialization sequence. Can you expand what is going wrong with sending the init RPC to a node other than node zero.

a-robinson · 2017-07-24T14:04:38Z

Can you clarify what you mean by "nodes that are join targets, instead of an arbitrary node"? Does this mean that your join configuration effectively needs to be a DAG where all nodes can reach the node that you send the init command to?

bdarnell · 2017-07-24T17:00:09Z

Can you expand what is going wrong with sending the init RPC to a node other than node zero.

In the current implementation, a node is either serving the init method or the regular rpc interface, but not both at the same time. When node zero (the join target) is initialized, it switches to the regular rpc interface and the other nodes can join it. When another node is initialized, it tries to join node zero, but it can't because the gossip interface is not active yet. Option B is to enable the gossip interface earlier (and deal with the issues that arise because we have not previously had an active gossip server with an unknown cluster ID); option C is to handle this by extending the initialization rpc interface.

Does this mean that your join configuration effectively needs to be a DAG where all nodes can reach the node that you send the init command to?

Not just a DAG. In fact adding a cycle is one way to fix the problem. If for example all your nodes point to nodes 1, 2 and 3 as join targets, then you must init one of those three nodes. If your join flags describe a ring (node N joins node N-1 modulo num_nodes), you can init any node.

The status quo has similar constraints on the join graph: every node must be able to reach the node that was started without a join flag. The new system requires that every node be able to reach the init target. The question is just whether that undermines some of the flexibility we hoped to gain with the init command.

a-robinson · 2017-07-24T17:34:04Z

Not just a DAG. In fact adding a cycle is one way to fix the problem. If for example all your nodes point to nodes 1, 2 and 3 as join targets, then you must init one of those three nodes. If your join flags describe a ring (node N joins node N-1 modulo num_nodes), you can init any node.

The status quo has similar constraints on the join graph: every node must be able to reach the node that was started without a join flag. The new system requires that every node be able to reach the init target. The question is just whether that undermines some of the flexibility we hoped to gain with the init command.

Ah, of course, it needs to be fully connected, not acyclic (as in #13027).

Option A does feel a bit half-hearted, but it's just imposing a restriction that we could later open up to improve usability. This is how it'll effect each orchestration tool:

Kubernetes: It'll work fine, and in the process slightly simplify the config and eliminate one very unlikely edge case of node 1 creating a new cluster ID if it loses its disk and gets restarted while all the other processes are down.
Mesos: It'll work fine, but won't simplify anything (due to some of Mesos's more sophisticated options we're taking advantage of).
Swarm: It'll simplify our current deployment instructions, but won't work with a possible simplification of our insecure deployment (docs: docker swarm service unique service name #17133) due to the containers not having known names in advance. That's not a huge deal, though, given that that approach was only going to be practical for insecure deployments in the short term anyway.

I assume that for option C what you're proposing is that when a node gets initialized it will effectively forward on its new cluster ID via an RPC on the Init service to all targets it's trying to join? I actually like that quite a bit. It'd be fairly isolated from everything else. Option B, on the other hand, would be quite tricky to get right.

bdarnell · 2017-07-24T20:26:12Z

I assume that for option C what you're proposing is that when a node gets initialized it will effectively forward on its new cluster ID via an RPC on the Init service to all targets it's trying to join?

Yes, that's the idea. It's a more isolated solution than option B, but there's still a little plumbing complexity to use the gossip bootstrap targets in this new context (and it gets tricky if the join target is a load balancer or DNS alias, although now that I think about it those cases would probably also have difficulty forming a fully connected graph).

a-robinson · 2017-07-25T14:01:44Z

What happened to the goal from the RFC of verifying that the specified number nodes are connected? That would appear to not work without something a little more sophisticated (like option B).

bdarnell · 2017-07-25T15:45:10Z

We decided to drop that from the RFC because it seemed like unnecessary complexity.

a-robinson · 2017-07-25T17:18:27Z

That's what I thought, then I got confused by seeing it still in the summary of the RFC. I'll change it.

but there's still a little plumbing complexity to use the gossip bootstrap targets in this new context

Is that so? Would a node that needs to be initialized have any bootstrap targets other than what was specified on the command line?

Anyways, I take it this is ready to review? If so I'll take a look at it within the next day or so. I'd like to see option C get implemented if we have time, but it seems like it'll just be additive so discussing it shouldn't block reviewing option A if that's ready.

bdarnell · 2017-07-25T17:31:03Z

Is that so? Would a node that needs to be initialized have any bootstrap targets other than what was specified on the command line?

Just the ones on the command line. It's just a little annoying to use them because we have to work outside the regular connection pool (which wants heartbeats to work, etc).

Anyway, yes, this is ready to review. We can implement option C later if we have time and/or it seems to be useful.

a-robinson · 2017-07-26T15:39:32Z

Nice, this turned out very clean!

Reviewed 1 of 2 files at r1, 1 of 7 files at r6, 2 of 8 files at r8, 3 of 8 files at r9, 4 of 9 files at r10, 6 of 6 files at r11.
Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.

pkg/cli/init.go, line 33 at r11 (raw file):

Perform one-time-only initialization of a CockroachDB cluster.

After starting one or more nodes with --join flags, run the init

The --join requirements needed for init to work should be included in the description here.

pkg/server/init.go, line 33 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/mux/mu/

If the mutex only protects waiting, we like to use this anonymous struct pattern:
mu struct {
    sync.Mutex
    waiting bool
}
But if it's held for the entire duration of Bootstrap, it might be better not to do that to highlight the fact that it protects more than just the variables it contains.

It looks like this never got changed, but it should

pkg/server/init.go, line 56 at r11 (raw file):

waiting: false

Not really needed/idiomatic

pkg/server/init.go, line 94 at r11 (raw file):

	s.mu.Lock()
	defer s.mu.Unlock()
	if !s.waiting {

Isn't there an inherent race between when we start serving and when we start awaiting bootstrap? It'd be safer to just initialize waiting to true in newInitServer and set it to false in awaitBootstrap.

pkg/server/init.go, line 95 at r11 (raw file):

	defer s.mu.Unlock()
	if !s.waiting {
		return nil, errors.New("Init not expecting Bootstrap")

Since this is going to be an error that new users may see when they're failing to set up their cluster properly, something with a little more explanation would be nice.

pkg/server/init.go, line 99 at r11 (raw file):

	if err := s.server.node.bootstrap(ctx, s.server.engines); err != nil {
		log.Error(ctx, "Node bootstrap failed: ", err)

s/Node/node for consistency with the rest of our log messages

pkg/server/node.go, line 333 at r11 (raw file):

func (n *Node) bootstrap(ctx context.Context, engines []engine.Engine) error {
	n.initialBoot = true
	// This node has no initialized stores and no way to connect to

This comment isn't really applicable anymore.

pkg/server/server.go, line 775 at r11 (raw file):

		log.Infof(ctx, "**** add additional nodes by specifying --join=%s", s.cfg.AdvertiseAddr)
	} else {
		log.Info(ctx, "No stores bootstrapped and --join flag specified, starting Init Server.")

s/No/no

pkg/server/server.go, line 804 at r11 (raw file):

	if err != nil {
		log.Error(ctx, "Node.start error: ", err)

Did you mean to leave this in? The error should already be getting printed out by the caller, shouldn't it?

Comments from Reviewable

bdarnell · 2017-07-26T21:28:21Z

Review status: 13 of 16 files reviewed at latest revision, 22 unresolved discussions, some commit checks failed.

pkg/server/init.go, line 33 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It looks like this never got changed, but it should

Done

pkg/server/init.go, line 56 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

waiting: false

Not really needed/idiomatic

Done

pkg/server/init.go, line 94 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Isn't there an inherent race between when we start serving and when we start awaiting bootstrap? It'd be safer to just initialize waiting to true in newInitServer and set it to false in awaitBootstrap.

Renamed the variable to awaitDone so it can keep a zero default.

pkg/server/init.go, line 95 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Since this is going to be an error that new users may see when they're failing to set up their cluster properly, something with a little more explanation would be nice.

Changed to "bootstrap called after initialization complete". Better suggestions welcome.

pkg/server/init.go, line 99 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/Node/node for consistency with the rest of our log messages

Done

pkg/server/node.go, line 333 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

This comment isn't really applicable anymore.

Removed

pkg/server/server.go, line 581 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

that will intercept all connections while the cluster is initializing.

Done

pkg/server/server.go, line 775 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/No/no

Done

pkg/server/server.go, line 804 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Did you mean to leave this in? The error should already be getting printed out by the caller, shouldn't it?

Removed

Comments from Reviewable

a-robinson · 2017-07-26T21:39:09Z

Reviewed 3 of 3 files at r12.
Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

pkg/cli/init.go, line 33 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

The --join requirements needed for init to work should be included in the description here.

ping

pkg/server/init.go, line 95 at r11 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Changed to "bootstrap called after initialization complete". Better suggestions welcome.

I might tack a "cluster has already been initialized" on the end, but this works.

pkg/server/init.go, line 78 at r12 (raw file):

	}
	s.mu.Lock()
	s.mu.awaitDone = false

s/false/true/

Comments from Reviewable

bdarnell · 2017-07-26T21:48:55Z

Review status: 14 of 16 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.

pkg/cli/init.go, line 33 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

ping

Done

pkg/server/init.go, line 95 at r11 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I might tack a "cluster has already been initialized" on the end, but this works.

Done. Note that there's probably a lot more work to be done on error messages here. This error is only returned on a very narrow race condition; if init is called on a node that is fully initialized and transitioned to serving mode, the error that comes back is something like a 404 because the init server is not even running.

pkg/server/init.go, line 78 at r12 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/false/true/

Oops. Done

Comments from Reviewable

--join flag is passed. Use cmux to initially swallow all connections into the `initL` which is served by the initServer. Only the Init grpc interface is registered so all other connections will get automatically closed. Once cockroach is ready to serve, flip the initL matcher to match nothing, effectively removing the initServer. Add `init` cli command.

bdarnell mentioned this pull request Jun 7, 2017

server: Refactor node ID initialization #16363

Closed

bdarnell changed the title ~~Support Init Command~~ server: Support Init Command Jun 7, 2017

bdarnell changed the title ~~server: Support Init Command~~ server,cli: Support Init Command Jun 7, 2017

adamgee force-pushed the 5974_revisit_cluster_init branch 2 times, most recently from e528993 to 809b910 Compare June 9, 2017 18:55

bdarnell mentioned this pull request Jul 15, 2017

server: can't hook initialization steps past cluster settings propagation #17032

Closed

bdarnell force-pushed the 5974_revisit_cluster_init branch from 809b910 to 0113003 Compare July 21, 2017 03:42

bdarnell force-pushed the 5974_revisit_cluster_init branch 5 times, most recently from dadba7b to ee3bf84 Compare July 22, 2017 03:12

a-robinson self-requested a review July 26, 2017 15:32

bdarnell force-pushed the 5974_revisit_cluster_init branch from ee3bf84 to 6dce56a Compare July 26, 2017 21:27

bdarnell force-pushed the 5974_revisit_cluster_init branch from 6dce56a to 86d5e41 Compare July 26, 2017 21:47

adamgee and others added 3 commits July 26, 2017 21:51

Add empty Bootstrap RPC.

a689cf4

server: Refactoring and tests for init command

639bdc0

bdarnell force-pushed the 5974_revisit_cluster_init branch from 86d5e41 to 639bdc0 Compare July 27, 2017 03:11

bdarnell merged commit 1a69b50 into cockroachdb:master Jul 27, 2017

sploiselle added the docs-todo label Aug 3, 2017

sploiselle mentioned this pull request Aug 3, 2017

core: Explicit init command cockroachdb/docs#1288

Closed

bdarnell mentioned this pull request Aug 7, 2017

server: revisit cluster initialization #5974

Closed

bdarnell mentioned this pull request Mar 1, 2018

cli: correct cockroach init help text #22606

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server,cli: Support Init Command #16371

server,cli: Support Init Command #16371

adamgee commented Jun 6, 2017 •

edited by irfansharif

Loading

cockroach-teamcity commented Jun 6, 2017

bdarnell commented Jun 12, 2017

bdarnell commented Jul 21, 2017

petermattis commented Jul 21, 2017

bdarnell commented Jul 22, 2017

petermattis commented Jul 24, 2017

a-robinson commented Jul 24, 2017

bdarnell commented Jul 24, 2017

a-robinson commented Jul 24, 2017 •

edited

Loading

bdarnell commented Jul 24, 2017

a-robinson commented Jul 25, 2017

bdarnell commented Jul 25, 2017

a-robinson commented Jul 25, 2017

bdarnell commented Jul 25, 2017

a-robinson commented Jul 26, 2017

bdarnell commented Jul 26, 2017

a-robinson commented Jul 26, 2017

bdarnell commented Jul 26, 2017

server,cli: Support Init Command #16371

server,cli: Support Init Command #16371

Conversation

adamgee commented Jun 6, 2017 • edited by irfansharif Loading

cockroach-teamcity commented Jun 6, 2017

bdarnell commented Jun 12, 2017

bdarnell commented Jul 21, 2017

petermattis commented Jul 21, 2017

bdarnell commented Jul 22, 2017

petermattis commented Jul 24, 2017

a-robinson commented Jul 24, 2017

bdarnell commented Jul 24, 2017

a-robinson commented Jul 24, 2017 • edited Loading

bdarnell commented Jul 24, 2017

a-robinson commented Jul 25, 2017

bdarnell commented Jul 25, 2017

a-robinson commented Jul 25, 2017

bdarnell commented Jul 25, 2017

a-robinson commented Jul 26, 2017

bdarnell commented Jul 26, 2017

a-robinson commented Jul 26, 2017

bdarnell commented Jul 26, 2017

adamgee commented Jun 6, 2017 •

edited by irfansharif

Loading

a-robinson commented Jul 24, 2017 •

edited

Loading